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Consider the random Dirichlet partition of the interval into n fragments with parameter 6 > 0. We recall the unordered Ewens 
sampling formulae from finite Dirichlet partitions. As this is a key variable for estimation purposes, focus is on the number 
of distinct visited species in the sampling process. These are illustrated in specific cases. We use these preliminary statistical 
results on frequencies distribution to address the following sampling problem: what is the estimated number of species when 
sampling is from Dirichlet populations? The obtained results are in accordance with the ones found in sampling theory from 
random proportions with Poisson-Dirichlet distribution. To conclude with, we apply the different estimators suggested to two 
different sets of real data. 
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1 Introduction 

Dirichlet partition of an interval can be viewed as a generalization of some classical models in ecological statistics. 
For example, on the one hand, when 9 = 1, the Dirichlet partition corresponds to the broken-stick model (see 
Feller (1966), pages 22-24), one of the most famous stochastic model of relative species abundance studied by 
McArthur (1957) (see also Tokeshi (1993) for an exhaustive survey on species abundance models). On the other 
hand, when 6 goes to infinity, the Dirichlet partition is deterministic and uniform and when 6 goes to 0, jointly with 
the numbers of fragments going to infinity, the ordered version of Dirichlet partition identifies with the Poisson- 
Dirichlet (PD) partition and corresponds to the Fisher's log-series model. These relationships between all models 
cited above was already pointed out early by Simpson (1949) (but the term "Dirichlet distribution" was coined by 
Wilks (1962) many years later). 

The organization of this manuscript is the following. In Section 2 we recall the Ewens sampling formulae when 
sampling is from finite Dirichlet partitions. Consider the random Dirichlet partition of the interval into n fragments 
with parameter 9 > 0. Elementary properties of its Dn{9) distribution are first recalled in section 2.1. Section 
2.2 describes some motivational sampling problems from Dirichlet proportions. Some generalities about sampling 
from Dirichlet partition are first proved in subsection 2.3. Subsection 2.4 is devoted to the Ewens sampling formu- 
lae when sampling is from Dirichlet partition Dn{9). Here the order in which sequentially sampled species arise 
is irrelevant. Similarly, the second Ewens sampling formula under the same hypothesis (as a problem of random 
partitioning of the integers). As corollaries to these results, assuming n "f oo, 6* J, while n0 = 7 > 0, the usual 
well-known sampling formulae will be deduced in each case when sampling is from PD{j) distribution. These 
general sampling formulae are also illustrated in detail in two particular cases: the Bose-Einstein case (when 9 = 1) 
and the Maxwell-Bolztmann case (when 9 tends to infinity). As this is the key variable for estimation purposes, 
focus is also made on the number of distinct visited species in the sampling process, in each case. 

Section 3 concerns the statistical problem of estimating the number of distinct species in a Dirichlet population. 
A maximum-likelihood estimator is developed, which is derived from sampling formulae recalled in the previous 
section. We recall also the minimum variance one suggested by Keener et at. (1987). For some particular classes 
of Dirichlet partitions, we supply simpler expressions for these estimators. We study some related statistical ques- 
tions like stopping rule in the sampling process and goodness of fit. In the last subsection, we explore the difficult 
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problem of estimating jointly n and 9. 

At least section 4 is devoted to applications to real data. The first two data sets concern word usage by two authors 
(Keener et ai, 1987) while the second two data sets deal with tropical beetles species (Janzen, 1973). 

2 Sampling from Dirichlet proportions 

First we recall basic properties about the symmetric Dirichlet distributions. Second we give some motivation about 
sampling with this distribution. Then we recall sampling formulae that will be useful later. 

2. 1 Dirichlet partition of the interval 

Consider the following random partition into n fragments of the unit interval. Let > be some parameter and 
assume that the random fragment sizes S„ — {Si , . . . , Sn) (with J2m=i ^rn = 1) are exchangeable and distributed 
according to the (symmetric) Dirichlet Dn{9) p.d.f. which is defined on the simplex, i.e. 

^ ' m—1 

Alternatively the distribution of S„ = {Si , . . . , Sn) can also be characterized by its joint moment function 



V(gi,...,(7„) eR" , Eg 



n ^™ 



T{n0) r{e + qm) 



We shall put S„ ^ Dn{0) if S„ is Dirichlet distributed with parameter 9. In such case Sjyi — Sji, for any 
m G {1, ■ ■ • ,n}, independently of m and the individual fragment sizes are all identically distributed. Their 
common p.d.f. on the interval (0, 1) is the beta distribution with parameter {6, {n — 1)6). As a result, parameter 
9 interprets as a "precision" parameter indicating how concentrated the distribution of S„ is around its mean 
(i, . . . , i): the larger 9 is, the more the distribution of S„ is concentrated around its mean. Indeed as one can 
check that, for any to e {1, . . . , n}, E(5„i) = ^ and var(S',„) = :;;T^^^^fij- 

In the random division of the interval as in equation ([l|), although all fragments are identically distributed with 
expected sizes of order i, the smallest fragment size grows like rt~(^+i)/^ while the size of the largest is of order 
log(7ilog^^^ n). Consistently, the smaller 9 is, the larger (resp. the smaller) the largest (resp. the smallest) 
fragment size is: hence, the smaller 9 is, the more the values of the Sm are disparate with high probability. Let 
S(n) = ■ ■ ■ , be the ordered version of S„ with S'(i) > • • • > The smaller the parameter 9 is, the 

more the size of the largest fragment S'(i) tends to dominate the other ones. On the contrary, for large values of 
9, the fragment sizes look more homogeneous and distribution equation (jl]) concentrates on its centre (i, . . . , i). 
For large 9, the diversity of the partition is small. 

When 6^1, the partition corresponds to the standard uniform random partition model of the interval. When 
9 t oo, S„ approaches the deterministic partition of the interval into n equal parts with sizes 1/n. Although S„ 
has a degenerate weak limit, when n f oo, 6^0 while n9 = j > 0, this situation is worth being considered. 
Indeed, many interesting statistical features emerge from the fact that in such asymptotic regime S(„) converges 
to S(oo) having the Poisson-Dirichlet distribution PD{'y) with parameter 7 (see Kingman, 1975). These three 
situations will be referred later respectively by Bose-Einstein, Maxwell-Boltzmann and Kingman cases. 

2.2 Sampling: motivations 

We shall be interested in sampling problems from random partition S„, where S„ £>„ {9). Since S„ is random, 
sampling occurs in a random environment. Dirichlet distributions are ubiquitous in the natural sciences and this 
is why we chose this model for the random probabilities S„. We refer to Vlad et al. (2001) where it is shown 
that Dirichlet distributions may be seen as limit laws of certain "dilution" processes, and also that they maximize 
entropy under constraints, satisfy some scale-invariance property, etc. Due to its specific statistical properties as a 
random partition, many combinatorial issues arising in this sampling context can receive a proper and exact ana- 
lytical answer We shall illustrate this point. 

Sampling from S„ consists in a recursive k throw of iid uniformly distributed random variables on the 
interval. It is said that fragment number to is visited by some uniform throw if its hits the interval 
[Si + • • • + 5„i-i, 5i + • • • + Sm] of length Sm. Before giving some technical details of the samphng problem, 
let us Ust some motivating concrete images of the sampling problem from S„: 
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• Sm could be the random abundance of species m from a population with n "animals". Some sampUng 
process starts when a biologist records each new species met at each of his k measurement campaigns. 

• Sm could be the random size of district number m of some city, with m e {1, . . . , n} (e.g. with n = 20 for 
Paris). An unfriendly sampling process could be a scattered shot bombing with k bombs. 

• Sm could be the random popularity of book to in a library with n books. The sampling process is when k 
consecutive readers borrow books from the library while respecting their popularities. 

• Sm could be the random probability to be born on day to, with n = 365. A classroom with k students is a 
fc— sample from S„. 

We now give a non-exhaustive list of statistical problems of interest in this context: 

• Abundance estimation: given a sample size k, estimate the number n of species and/or the parameter 6, 
exploiting for example the information on the empirical number p of distinct visited species or from the 
knowledge of the empirical probabiUty to visit twice the same species. 

• Match box problem: what is the state of fragment occupancies if sequential samphng process is stopped 
when some fragment has received c visits for the first time (if n = 2, this is the randomized Banach match 
box problem). In particular what is the probability that some cell is empty at this stopping time? 

• Birthday problem: what is the sample size until the first visit to two species of the same type? 

• Coupon collector problem: what is the sample size until all species have been visited at least once (or r 
times)? 

• Law of succession: given the random number of occurrences of species m in the fc— sample and the number 
p of distinct visited species, what is the probability to discover a new species in a fc + 1 sample? or what 
is the probability that the (fc + l)-th sample is one from the previously encountered species already met a 
certain amount of times. 



We shall now be more precise and treat rigorously some of the raised problems, starting with the samphng 
problem before focusing on the occupancy disttibutions. In the next section, using these results, we shall come to 
the important problem of estimating n when it is unknown. 

2.3 Sampling: preliminaries and generalities 

Let (C/i, . . . , Uk) be fc iid uniform throws on S„. Let 

K„,fe = (^n,fe (1) , • • • , K.n,k {n)) > 

be an integral-valued random vector which counts the number of visits to the different fragments in a fc-sample. 
Hence, if Mi is the random fragment number (or label) in which the ^-th trial falls, ICn.k {m) = X]f=i ^ = '^)' 
m G {1, . . . , n}. Under our assumptions, for instance, we have that Pg (M; = to | S„) = Sm (the random 
probability to visit species m is equal to its abundance) but also that the conditional probability to observe species 
number to in a fc-sample is 

T^m {Sm) = Pe {IC„,k (m) > I S„) = 1 - (1 - Smt . (2) 

Let us now focus our attention on the distribution of the occupancies K„^fe. With J2m=i km = k and k„ = 
(fci, . . . , fc„) > 0, K„^fc follows the conditional multinomial distribution: 

fc! " 

Pe (K„,fe = k„ I S„) = = — — S^ . 

Averaging over S„, using Dirichlet integrals, one finds 

Pe (K„,fe = k„) = EPe {Kn,k = k„ I S„) = kknpl^Jh^^ 
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where [6]k = {6)k/k\ and {e)k = 8(6 +1) ■ ■ ■ {9 + k- l),for any fc > 1, with (^)o = 1. Applying Bayes formula, 
the posterior distribution of S„ given K„_fc = k„ is determined by its p.d.f. at point s„on the simplex as 

This shows, as it is well-known, that 

S„ I Kn^k=kn^Dn (W+k„), 

an asymmetric Dirichlet distribution with parameters 01 + k„ = {9 + ki, . . . ,9 + kn). Furthermore, 

Vme{l,...,n}, E45„ I K„,fe = k„) = . 

This suggests a recursive approach to the sampling formula where successive sample are drawn from the corre- 
sponding iterative posterior distributions. More specifically, let (Mi, . . . , M^) e {1, . . . , n}*^ be the numbers of 
the successive fragments thus drawn. Then, 

Pe (Ml = mi) = E (P^ (Mi = rm) | S„) ^ (5„ J = = - , 

n9 n 



and 



Pe (Ma = ma | Mi) 



Pe {Mk = mk I Mi,...,Mfc_i) 



I (Ml ^ ma) 

n9 + l 



9 + j:i=iiiMi=mk) 



n9 + k-l 

Proceeding in this way, the joint distribution of (Mi , . . . , Mfc) reads 

9 'ri' ^ + I = _ n:=i {o)k 



Pe (Ml = mi,...,Mk = mk) 



n9 



n 



n9 + l 



{n9)k 



where km = Si=i I (™( = ™)- This distribution being invariant under permutations of the entries, the sequence 
(Ml , . . . , Mfe) is exchangeable. It is called a Polya urn sequence. We now prove the following convergence result: 

Lemma 2.1 Almost surely and in distribution, the following convergence holds: 



,k a.. 



k k^oo 

Proof Let us first prove the convergence in distribution. The joint conditional generating function of K„^fc reads 



which is homogeneous with degree k allowing to compute Eg (jX^=i^'m"'''^^^^- Further, defining = 

Xm/ Y^m=i ^"t' where Xm ^ gamma(6'), for all m, G {1. . . . , n}, using independence between {Xi, . . . , X„) 
and J2m=i ^">- recalling that {X\, . . . , Xn) has the Dirichlet distribution Dn{9), we get 

fe"l 

-Eg 



'Cn,Jc(»")/fe 



r(n6 



r (n9 + A:) 

r (n.(9) 
fe ~cx) r {n9 -t- k) 

ej n 



Efl 



x„ 



fc— *oo 



■ n \ ^ 

,m=l / 

-Eelflt 



1 " ~ V 
- E ^™ log 



Thus, 



K 



n.fc a.s. 



A; fe— >cxD 

By applying the strong law of large numbers (conditionally given S„), the above convergence in distribution also 
holds almost surely. This shows that K„ ^ / k can be used as an consistent estimator of S„. □ 
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2.4 Ewens sampling formulae for Dirichlet partitions 

Ewens Sampling Formula (ESF) gives the distribution of alleles (different types of genes) in a sample with size k 
from the Poisson-Dirichlet partitioning PD (7). Alternatively, it can be described in terms of sequential sampling 
of animals from a countable collection of distinguishable species drawn from PD{'-f). It provides the probabiUty of 
the partition of a sample of, say k, selectively equivalent genes into a number of alleles as population size becomes 
indefinitely large. When the order of appearance of sequentially sampled species does not matter, we are led to 
the first ESF for unordered sequences. A second equivalent way to describe the sample is to record the number of 
species in the /c-sample with exactly i representatives, for i G {0, . . . ,k}. When doing this while assuming the 
species have random frequencies following PD{j) distribution, we are led to a second Ewens Sampling Formula. 

We recall here the exact expressions of both first and second Ewens samphng formulae, when sampling is first 
from finite Dirichlet random partitions with n fragments. Here, the order in which the consecutive animals are 
being discovered in the sampling process is irrelevant. In the sampling formulae, the joint event that there are p 
distinct fragments visited will also be taken into account. These samphng formulae give both ESF formulae from 
PD{'y) when passing to the Kingman limit. 

Let S„ be the above Dirichlet random partition with parameter ^ > 0. Let fc > 1 and (?7i , . . . , f/fe) be fc iid uniform 
random throws on [0, 1]. Let then (Mi, . . . , Mfe) be the (conditionally iid) corresponding animals species, with 
common conditional and unconditional distributions: 

VmG {l,...,n} , Pe{M = m\Sn) = Sm , 

and 

Vm G {1, . . . ,n} , Pe (M - m) = E [P^ (M = to | S„)] = Eg (Sm) = - ■ 

n 

Recall ICn,k {m) = Y^\=i I = ra) counts the random number of occurrences of species m in the fc-sample 
and let Pn,k = X^to=i ^ {^n,k (m) > 0) count the number of distinct species which have been visited in the k- 
sampUng process. 

There are two occupancies variables of interest: the first one will lead to the first Ewens sampling formula while 
the second corresponds to the second Ewens sampling formula. 

I. For any q G {1, • • • ,p}, Bn,k{Q) > is the numbers of animals of species q where the Pn,k = P species 
observed were labelled in an arbitrary way (independently of the sampling mechanism). Thus Bn,k differs 
from lCn,k in the sense that all the components of Bn,k are positive. 

II. For any i G {0, . . . , fc}, An,k{i) is the number of species in the fc-sample with i representatives, i.e. 

n 

■A.n,k{i) = #{m G {1, . . . ,n} : ICn,k{m) = = ^ (^">fc (™) = ^) • 

m=l 

Then Yli=o •^n,k (*) — nis the (unknown) number of fragments and X^iLi •^n,k (i) = pis the number of 
fragments visited by the /c-sample and An^k (0) the number of unvisited ones. Note that Yli=i iAn,k (i) = k 
is the sample size. The random vector {An,k (1) > ■ • ■ i An,k (k)) is called the fragment vector count or the 
species vector count in biology, see Ewens (1990). 

For each of the two sampling problems, we easily obtain the Ewens sampling formulae from finite partitions S„ 
drawn from Dirichlet distribution. The following result can be found in Huillet (2005) (see also Ewens (1972) for 
the PD case) 

Theorem 2.1 

I For any (bi, . . . ,bp) such thatVq G {1, • • ■ ,p}, bg ^ 1 and Y^g^i bg = k, we have 

Pe (1) = 61, ... , (p) = b,; P„,fe = p) = (") ^ (9),^ (3) 

II For any (ai, . . . , a^) > such that Y^^=i ^'^i = ^ Si=i '^i — P> 

r)' 1 -A- 

P. (X.. (1) = a, . . .,A.,k ik) = ak-, Pn,k = P) = (n:,)!nt,(;!-^a.!) P);5 ^ ''^ 
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From equation or equivalently from equation (Q), one can obtain the marginal distribution of Pny. 



Theorem 2.2 For any p ^ 1, 

where 

We recall below a straightforward representation of the probability Vg {Pn.k = p) under the form of an alternate 
sum (see for example Keener et ai, pages 1471-1472). 

Proposition 2.1 For any m £ {0, . . . , n - 1}, let (0)„^fc.„, = '^''Yn7)P'' - ^'^^ distribution ofPn., k is given by 

P« (^-^ = P) = E (-1)'"" (;;) (^)nMn-, • (6) 

Let us now focus on two problems related to sampling as explained previously. 

Tlie law of succession We would like to briefly recall a related question raised in Donnelly (1986) and Ewens 
(1996), concerning the law of succession. 

1. Let the "Mk+i is new" denote the event that Mk+i is none of the previously observed species. One can 
prove that 

Pe(Affc+iis new | S„.fe(l) = 6i, . . . , Bn,k{p) = bp; P„,fc = p) = (7) 

which is independent of cell occupancies 6i , . . . , 6p but depends on the number p of distinct species already 
visited by the fc-sample. With k = p = 2, this is the probability that the first two random throws will visit 
any two distinct species. The complementary probability that it does not is thus 1 — = ^e+i ■ 

probabihty to visit any fragment twice varies between 1 and - when 9 varies from (the largest fragment 
dominates) to infinity (fragment sizes distribution approaches^). 

2. Similarly, let the event "Mk+i is a species seen br times" denote the fact that the (k + l)-th sample is one 
from the previously encountered fragment already visited br times. We easily get 

Pe{Mk+i is a species seen fo^times | S„,fc(l) = 6i, . . . , Bn.kip) = bp] PnM ^ p) = ^ ^ (8) 

which is as previously independent cell occupancies but also of the number p of distinct species. 
The number of distinct observations From equations (0) and (||), we also have the transition probabilities 

Ve{Pn.k+i =P+l\Pn.k=p)= ^^^^TVT- 

nO + k 

and 

It follows that we have the following recursion for the distribution of Pn,k- 

Pe{Pn,k+l ^P)= ("- P + l^^ P.(i^..fe = P - 1) + ^^Po{Pn.k = P) ■ 

110 + k no + k 

Using equation (^, we obtain the following triangular recurrence for the quantities Bk,p{0) 

Bk+i,p {9) = 9Bk,p-i (9) + {p9 + k) Bk,p (9) . 
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These should be considered with boundary conditions 

Bk^o [0) = Bo,p {B) = 0, 

except for Bofi{0) — 1. Under this form, Bk^p{0) turns out to be the Bell polynomial in the variables xi = 
{9)i, Xi — {0)2, ■ ■ ■ ,Xk = {0)k- This leads in particular to Bk.i{0) = {0)k, k ^ 1 and to 

Special cases Let us now study the three special cases mentioned in the introduction of this section. 
1. Bose-Einstein case. When 9 — 1, equation simplifies to 

(") 

Pl(^n,fe(l) ■,Bn^k{p) = bp^Pn^k =P)= ,n+k-l^ ' 

\ k ) 

which is independent of the cell occupancies (&i, . . . , bp) (i.e. the probability is uniform). As there are 
(plij) sequences ^ 1 for all g e {1, . . . satisfying = k, we get Bk^p{l) = ^(^1^) (called Lah 
numbers) and 



Vpe {l,...,nAfc} , Pi(P„,fc=p) 
Equation (^ reduces to 



Pi(^«,fc(l) = ai, ■ ■ . ,^„,fc(fc) = ak]Pn,k = p) = T^r+FTx 



/n + fe — 1\ T-rfc 1 

I fe j 11^=1 

2. Maxwell-Boltzmann case. As 6' f 00, the probability displayed in equation (^ converges to 

Poo(S„,fc(l) = 61, . . . ,Bn,k{p) = ^p; -Pn.fc = P) 

With S't,;, the second kind Stirling numbers, we get 



k\ 1 



Vpe{l,...,A:}, P^(P„., =p) = -i^!^^. (9) 



This result is ancient and well-known (see Johnson and Kotz, 1969). Equation (g) convergences to 



Poo(-4„,fe(l) = fll, ■ • ■ , Aufe(fc) = ak]Pn,k =P) = 



3. Kingman case. Consider the situation where n ] 00, 6 [ Q while rt(? = 7 > 0. In such case, the probability 
displayed in equation (^ converges to 

P;(S,(1) = 61, . . .,Bk{p) - 6p; P;.. = p) = .tp . • 

With Sfc^p the absolute value of the first kind Stirling numbers, we get 

\/pe{i,...,k} , P*{P^=p) = tlhp . (10) 

(7jfc 

It follows that the probabiUties displayed in examples (Q) and (^ converge respectively to 

^ and (11) 



7+fc 7+fc 

We note also that the distribution of Pk in this case is in the class of exponential families. We recall the 
important result of Korwar and Hollander (1973): 

Pk a.s. 

> 7 . 

logfc fe^oo 

At least, in the Kingman limit, the probability displayed in (^ converges to 

P;(A(1) = ai, . . ■,Ak{k) - ak-, Pk=p)= , ^ ^^C" , ^ ■ 



8 



Thierry Huillet and Christian Paroissin 



3 Estimation of the number of species 



In this section we now investigate several statistical aspects dealing with the estimation of the number of species. 
We shall start with considering the problem of estimating the number of species, assuming first 9 to be known. The 
proposed procedure to estimate (6*, n) is explained after Then we consider two stopping rules for the sampling 
process and a goodness-of-fit procedure. To conclude, numerical simulations were carried out. 

3. 1 Estimation of n when 9 is known 



Using theorems 2.1 and 2.2, one can easily derive the conditional distributions of {Bn,ki^), ■ ■ ■ ,Bn,k{p)) and of 



(-4„^fe(l), . . . ,An,k{k)) which ai-e respectively: 

P- ^k,p\y) Oq. 

and: 

These conditional probabilities being independent of n, it follows that the random variable P„_fc is a sufficient 
statistic. 

Assume now that k > p. Using log-concavity in n of T'g{Pn.k — p), the maximum likelihood estimator n is given 
implicitly by: 

PeiPn,k = P) 
PeiPn-i,k=P) 

From equation identifying n with the largest integer short of the solution, the estimator ri we suggest is the 
fixed point of: 

(n9)k 

This estimator is biased from above. The estimator n of Keener et al. (1986) is given by: 

If fc ^ n, it is unbiased attaining the minimum variance bound (UMVB) and in this case we have: 

In practice, it is interesting to plot the observed number of species P against sample size k. If n < oo, P should 
stabilize to an asymptote. If this is not the case, P should drift at oo with k. For example, consider the following 
situation where P/ k ^ p E {0,1) when k ^ oo and P oo. Using an asymptotic representation of Bk,p{0) in 
this limit, one gets that ^ ^ where > is defined implicitly (see Keener et al. 1986) by: 



P 



dp* 



1 



Asymptotic normality of [n — kp*) / \fk could be proved as fc cx3. 

Special cases 

1. Bose-Einstein case {Q = 1). We find explicitly: 

^ P(fc-l) ^ Pk 
and " — 



k-P k- P+1 

The maximal value which n can take is obtained if fc — P = 1; in this case n = P^. Its minimal value is 
1 if P = 1 for all k. Note that in the Bose-Einstein model, p* = p/{l — p) and both n/k and n/k would 
converge to assuming the asymptotic regime P/fc p. 
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2. Maxwell-Boltzmann case (9 — > oo). The maximum likelihood n solves: 
and, with Sk,p the second kind StirUng numbers, the UMVB estimator n in (|l^ becomes: 

n , Sk.P-1 
^k,P 

recalling Bk,p{9) ~ 0''Sk,p as fc ^ oo. 

3. Kingman case. Indeed there is no estimation of n problem (because n — oo), rather the problem is to 
estimate 7 > which is the unique remaining parameter. A situation in which the Kingman model fits best 



to data is a situation for which one should conclude n = 00. Recalling equation (10), the MLE 7 of 7 is 
characterized by log P* {Pk — P) (7) = 0, hence implicitly by: 



It is biased and involves the problem of inverting the generalized harmonic sequence ^k- The properties of 
this estimator are well studied (see Carlton (1999) for a review). In particular, 

^ Pr 

1 ' 7 ■ 



In sharp contrast with the finite n case, there is no UMVB estimator 7 of 7 itself (nor of any polynomials in 
7), because if 7 = 0(-P) existed and were unbiased, function ip would satisfy: 



k 

^Ysk,p(j){p) = 7(7)a 
p=i 



which is impossible because the left-hand-side is a polynomial of degree at most fc in 7 whereas the right- 
hand-side is a polynomial of degree fc + 1. So, if the problem is to estimate 7, 7 turns out to be the more 
satisfactory estimate in this case, despite its biased property. However there are UMVB estimators of rational 
functions of 7 of the form: 



We{l,...,fc}, n(7) 

They are given by: 



7(7) 



k-l 



Sk,P 

For instance, from equation (p^, recalling Sk.p = if p > fc: 



n = • (16) 



E;(^^^) = 7^E7^^-^p = n(7). 

^ V Sk^P J {l)k ^ 

In particular, when / = 1, Fi = ^''^^•^'^ is an UMVB estimator of ri(7) = '''^j])^^ = i+k-i which, from 
equation (pi]), is the probability to observe a new species from fc-th trial (see Ewens, 1996). 

3.2 Joint estimations of 9 and n 

In some applications, 6 is also unknown and the question of its simultaneous estimation arises. As P„ ^ is not a 
sufficient statistic for 9 (from equation (|l2|), for example), we turn to a different point of view. We briefly recafl the 
idea of an estimator studied in Huillet and Paroissin (2005). With (C/i, . . . , Uk) the fc iid uniform random sample on 
[0, 1], let {Ml, . . . , Mk) be the corresponding fragments numbers hit (or animals species). With li,l2 € {1, . . . , fc}, 
let: 

n 
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denote the event that AIi^ — Mi^ for some fragment in S„. Introduce the pair-matching statistic: 

k 



It is the empirical probability that two randomly chosen items of the fc-sample are identical. In a genetic context, 
Dn,k is called the homozygosity of the sample (Tavare, 2004). Note that: 



Indeed for each visited species q, we need to count the number Bn.k{q) — 1 of returns to q, together with its 

multiplicity, with J2q=i ^n,k{q) — k. Note that Dn^k is a function of Pn,k and of Bn,k{q), g G {1, . • ■ , Pruk}- 

The expectation of Si-^,i^ is the probability that two fragments chosen at random are the same. From equation (Q) 
with k = 2 and p — 1, ai ~ 0, a2 — 1, we get: 

EeiDn.k) = Ee{Si„i,) = . (17) 

1 + nt/ 

Assume the observations are P„_fc = P and Bn,k{q) = Bq for q E {1, . . . , P}. Then the observed value D of 
Dn,k is: 



D = 




k 



\q=l 

Applying the method of moments, 9 can be estimated by (1 — D)/{nD — 1) which is a consistent estimator. 
Therefore, we propose the following estimators of the pair {9, n): 

- 1- D _ _ ((ni - l)9i)k 

Oi = ^ ^ , and Hi =P + H /^ ^' ' , (18) 

or: _ 

^1 = ^^T^ 7 ni = P + — ■ j^r— . (19) 

niD-1 BkA^^) 

These estimators are based on the couple of observations (P, D) and depend on which estimator of n itself was 
chosen. The numerical strategy is to get an implicit equation for rii (or ni) by substituting the expression of 9i 
(or 6*1) as a function of ni (or ni) in the second equation, solve it in rii (or ni) as a fixed point problem and then 
deduce the corresponding estimates for 9. Note that the functions involved in this fixed point problem are rational. 

An alternative estimation procedure which uses the observable P and cell occupancies Bn.k{q) — Bq, q G 
{1, . . . , P} is as follows. Consider the Renyi entropy of order a ^ 1 (Pielou, 1975) defined as follows: 



K = - log 

1 — a 



When a — 2, ipn — — log(Sm=i ^m) is the Simpson index of biodiversity (Simpson, 1949) up to the logarithmic 
transformation. As a tends to 1, 0„ tends to — J2m=i ^^gSm, the Shannon entropy, which is also an index of 
biodiversity (Pielou, 1975). Consequently we will rather consider random additive functional of S„: 

n 
m—1 

Hence with h{s) = s^, it is exactly the Simpson index and with h{s) — —slogs, it is the Shannon index. Below 
we will only consider the former case. The Simpson index of biodiversity can also be viewed as the size of a 
size-biased sample fragment from S„ for which: 



\m=l / 
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We obtain the same expected value as in equation (|17[): it was already noticed by Simpson (1949). For a review of 
various measures of species diversity, see Hubalek (2000). According to lemma 1, /C„.fc(m)/A: is an estimator of 
Sm, implying that the quantity: 



m— 1 

could be used as an estimator of (j)n. Clearly (j)n,k — ^ 4>n and in particular: 

A;— >oo 



Ee U„,fc > Ee((?!)„) . 

\ / k — 'oo 

To be effective this supposes n to be known, which could be not true. If n is unknown, enhancing fragments with 
small probability to occur, we shall rather consider a quantity based on the sample coverage C (the proportion of 
seen species in a fc-sample): 

Pn.k 

9=1 

Indeed following Chao and Shen (2003), we can consider: 

where 5^ = |f for all 9 e {1, ... , P^m} (in order to have J2q=i S'^ = 1) and where 71,(5^) = 1 - (1 - S'^)'' is 
the probability to observe fragment q among the P„ ^ which were effectively observed (see equation (^). Clearly, 
we have: 

1pn,k ' > <j>n ■ 

k — >oo 

But il)n,k involves unknown quantities. Hence, if Pn.k fragments are observed, an estimator of 5*^ = Sq/C is: 

k ■ 

It follows that a possible estimator of 'tpn,k is: 



9=1 1 - 

Particularizing to h{s) = (Simpson index of diversity), 

(^) 



'^Pn.k = E 



is such that: 

^ 6 -\- 1 

Ee{i'n,k) > ^ei(t)n) = -7— • 

k^oo no + 1 

Assuming the observation is 'tpn,k = (1 ~ ^)/ ("-V" — 1) is also a consistent estimator of 9 by application of the 
asymptotic method of moments. Note that: 



9=1 1 - ( 1 - ^ 



where Bg ^ 1 is an observed realization of Bn,k{q) for any q E {1, . . . , P}. hence it involves the observations 
P and Bq with q E {1, . . . ,P}- Therefore an alternative closely related to the two previous estimators (see 
equations (jTsj) and ([l9|)) for the pair {9, n) could be: 
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or: 



72 = and n2= P 



712^! — 1 



B. 



These estimators are based on the set of observations P and Bq, q E {1, . . . , P} and depend on which estimator 
of n itself was chosen. 

3.3 Stopping rules 

Here we now define three stopping rules for the sampling process. Indeed there is no "objective" stopping rules. 
Each of these stopping rules are based on some simple and interesting questions that arise naturally in our context. 
What is the sample size until the first visit to the smallest fragment? How long should one wait until all fragments 
have been visited (the coupon collector problem)? When the probability to discover a new species is smaller that 
a given threshold? Clearly the two first questions concern only the situation with n < oo while the last could be 
answered for any case. 

1. Let S'(„) be the smallest fragment among S„. Let -R'(ri) be the sample size until the first visit to S'(„). Then 

P(X(„)>fc|S„)-(l-%))' 

is the conditional waiting time until the first visit to this fragment. Averaging over the partitions S„, we 
obtain 



Pe (i^(„) > fc) = EeP (X(„) > k I S„) = Ee (1 - 5(„) 



> s 



ds = l 



S(n) > 1 - S*^ 



ds. 



To evaluate this probability, we thus need to compute the distribution of . We can prove 

where (t>n,e{s) = hg"^{t) \t=i is the n-fold convolution of < i-^ hg^s{t) = t^^^I{l t > s) evaluated at 
t = 1. This distribution could be computed in closed form. In the Bose-Einstein case (9 ~ 1), the expression 
simplifies to 

Pi(5(„) >s) = (l-ns);-i , 
where x+ = a; V (see Huillet, 2003). As a result, with fc ^ 1, 



Pi(if„>fc) - 1 
= 1 



1 — n 1 — s 



a/fe 



ds 



1 - n (l - si/^- 



= 1-- 1-- 



k-i ^1 



1 



ds 



n-1 



fc-i 



dx 



1__ 1__ 



k-l k-1 



E 

i=o 



fc- 1 
j 



2. Let — inf {fc ^ n ; Pn.k = n} be the first time that all species are observed in the sample. With k ^ n, 
we have 

Pe(if+ > fc) = P0{Pn,k <n) = l- Pg{Pn,k = n) . 
Recalling equation (||), we obtain 



ri-l 
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Recalling equation (gj), with n, this may also be written as follows: 

P,(if+ ^ fc) = ——BkAO) ■ 
{n0)k 

When 6=1 and 6* f oo, this formula further simplifies to give respectively the well-known results: 

o 



Pi(K+ fc) , 1, 



n+k-l\ /n+k-l\ 
k J 



and: 

3. A last possible stopping rule for the sample is the following: the proceeding with sampling is useless if the 
estimated probability to obtain a new species fi (or ri) is less than some small value e (say e = 0.01 for 
instance). Hence we are interested in the two following sample sizes: 

= mi{k ; fi < e} and = infj/c ; ri < e} , 

using respectively n or n for estimating n. When P distinct species have been observed at step k, the 
probability to get a new species at the (fc + l)-th trial is 

(n ~ P)9 
r-] = . 

n9 + k 

Using estimators developed previously, we obtain the two following estimates for ri : 

(n-P)e ^ (n-P)e 
ri = — and ri = — 

ne + k ne + k 

(if is also unknown, one could replace its estimates). When 6 = 1 (Bose-Einstein case), the explicit 
expressions for ri and for n lead to: 

^_n-P_P{P-l) ^_n-P_P{P-l) 
^ n + k ^ k"^ - P ^ n + k ^ k"^ + 1 

Obviously fi ^ ri. As a consequence, if^ ^ K,,. Thus, is of order Pe^^^^ and is of order 
{P{P — l))^/^e~^/^. Hence if P is large enough, and are of the same order. In the case of Kingman 
model, we have only an explicit expression for n. In such case, let us recall that: 

~ Sfe~l,P-l 

n = ■ — , 

which could be evaluated from inspection of a table of the first kind Stirling numbers. 

3.4 Goodness of fit using tfie second Ewens sampling formula 

Deciding which model fits the best to a concrete situation is a challenging problem. This can first be appreciated 
from the likelihood of the observations under the different models to be compared. We shall recall an additional 
procedure followed by Keener et al. (1987) for the case n < oo: First, a simple computation of = E0(y^„jt(i)) 
gives, using our notations 



k-i:l 



According to theorem 2.5 in Keener et al. (1987), a UMVB estimator of is obtained under the form: 

(i-l)! Bk,p{9) 

When 6=1 (Bose-Einstein case), it becomes: 

^ _ ip{p-l) {k - i - l)\{k - p)\ 
~ k-i + 1 fc!(fc- 1)! ' 



14 



Thierry Huillet and Christian Paroissin 



recalling the expression of p(l). Define next the MLE of ai to be: 

Based on the observations At of An,k{i), the goodness of fit of the model can be measured by one of the two 
following statistics: 

k k 

^^{Ai-a^Y/a^ or = Y^{A, - /ai . 

i=l 1=1 

In the case of the Kingman model, one can check that: 

k\ 7 



a, = E;(A(i)) 



i{k — i)! 7 + i — 1 
and the second statistic becomes: 

fc 

1=1 

where Si = where 7 is given by equation (|l5|). 

3.5 Numerical simulations 

We now apply the estimators developed and discussed previously on simulated data to observe the behavior of 
their quality when n, 9 and k are varying. 

We consider empirical distribution of fc-samples for a given Dirichlet partition (however one could rather pre- 
fer to consider empirical distribution of Dirichlet partitions and one fc-sample). For n € {100; 200; 500} 
and 9 e {i; 1; |}, we simulated a Dirichlet partition. Over this partition, we simulated 500 fc-samples with 
fc G {%; n; ^}. Note that we managed to use the same uniform random variables, so that P 3n corresponds to 

the same observations than the first ones of P„ „ and so on. We considered the two different cases: 9 known and 9 
unknown. In some cases, we did not use the estimators based on n since computations were too heavy. 

Tables [l| to |] contain the estimations (first when 9 is known and then when 9 is unknown) respectively for 6 = 1, 
9 = ^ and 6* = |. The numbers that appear in the cells are the empirical averages of the estimations of n or 6, and 
the number in brackets within the cells the empirical standard deviations (over 100 fc-samples as described above). 
Note that comparing standard deviations for the estimations of n and 9 does not make sense (one should rather use 
for instance the coefficient of variation which is dimension-less). 

For 6* = 1 (table P, the results are very good, even when 9 is unknown (but except for the estimation of 9 with the 
statistic ^'n.fe). For 9 = ^ and 6* = |, we did not run the estimators based on n for the reason given above. Results 
are quite good when 9 is known, but not so good when 9 is unknown. 

[Tab. 1 about here.] 

[Tab. 2 about here.] 

[Tab. 3 about here.] 



4 Applications to real data 

We applied our estimators to fourteen different sets of real data. These data are of various nature as we will see 
later. However we will only consider here two kind of real data sets. The first one was studied by Keener eta 
al. (1987) and deals with word usage by two different authors. The interest of this first data set is that we indeed 
known the number n to be estimated. The second one was extracted from observations made by Janzen (1973): 
these data correspond to beetles species observed either during the day or during the night and at different season. 

4. 1 Federalist papers data 

These data were considered by Mosteller and Wallace (1984) and concern word usage by James Madison and 
Alexander Hamilton. The Federalist papers were written between 1787 and 1788 to promote the new Constitution 
of the State of New-York. Published in various newspapers, these papers was signed under a pseudonym (as for 
instance Publius). Each paper was written by one of the three following persons: James Madison, Alexander 
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Hamilton and John Jay. The author of most of the seventy-seven papers is clearly identified but Madison and 
Hamilton disputed the authorship of twelve of them. Hence in order to determine the author of these disputed 
papers, many researchers studied papers written surely by Madison and Hamilton. In particular some of them 
focused on the occurrences of function words as defined by the Miller-Newman-Friedman list. Mosteller and 
Wallace (1984) developed a Bayesian approach to solve this problem. In order to do so, they divided a set of well 
identified texts (either by Madison or by Hamilton) of equal length. It corresponds to the data presented below. 

For q e {0, . . . , 6}, the two tables below gives the number A{q) of manuscripts (of the same type as those 
published in The Federalist and with comparable length) in which a specific word ('may' for Madison and 'can' 
for Hamilton) occurs exactly q times (these two words were the one selected by Keener et al. (1987), but Mosteller 
and Wallace (1984) studied more words). In this case, the exact numbers of manuscripts are known (respectively 
n — 262 and n — 247) and so we will be able to compare it with our estimation. Since we have this additional 
information, tables Q and || contain the column for q — 0, which is not available in real applied context. 

For these two data sets. Keener et al. (1987) computed the estimation of n for the three special cases considered 
here (i.e. for the Bose-Einstein, Maxwell-Boltzmann and the Kingman models) and also for the case where both 
n and 9 are unknown. Results are compared throughout the log-likelihood. Indeed, in their paper, there is no 
theoretical development when n and 6 are unknown. 

• Madison data: the sample size is fc = 172 and the number of distinct kinds of manuscripts is p = 106. 
When using the statistic Dn.k, we obtain rii = 274.6 and 9i = 1.09. When using the statistic tpn.k, we 
obtain n2 — 274.6 and 6*2 ~ 0.32. When assuming that both n and 9 are unknown. Keener et al. (1987) 
obtained respectively 217 and 1.998 as estimated values. This value for n is far from its correct value. 

[Tab. 4 about here.] 

• Hamilton data: the sample size is fc = 139 and the number of distinct kinds of manuscripts isp = 90. When 
using the statistic Dn.k, we obtain rii — 253.5 and 9i — 0.85. When using the statistic ipn,k, we obtain 
n2 — 4526.3 and 6*2 = 0.01. The second value is unsatisfactory. However when assuming that both n and 9 
are unknown. Keener et al. (1987) obtained respectively 10,000,001 and 1.094 x 10~^ as estimated values! 
This value for n is strongly far from its correct value. 

[Tab. 5 about here.] 

4.2 Tropical insect data 

Janzen (1973) observed tropical insects in twenty-five different sites in Costa Rica and the Caribbean Islands. This 
paper contains a remarkable collection of data. From it, we extracted three series corresponding to beetles collected 
either in day-time or in night-time, all during a dry season. These data were collected at the same site referred as 
"Osa secondary" in Janzen (1973). Observations of the first series were collected during the dry season of the year 
1967 in day-time while the ones of the second series were collected at the same period in night-time. At least 
observations of the last series were collected during the dry season of the year 1968 in day-time. 

• Osa secondary/day/dry/1967: it was observed k = 996 beetles and p — 140 distinct species. When 
using the statistic Dn^k, we obtain ni = 162.7 and 9i = 0.219. When using the statistic '4'n,k, we obtain 
n2 = 162.7 and ^2 = 0.211. 

[Tab. 6 about here.] 

• Osa secondary/night/dry/1967 : it was observed fc ~ 835 beetles and p = 151 distinct species. When 
using the statistic Dn k, we obtain ni = 184.1 and 9i = 0.268. When using the statistic ipn.k, we obtain 
n2 = 184.1 and ^2 = 0.252. 

[Tab. 7 about here.] 

• Osa secondary/day/dry/1968: it was observed k — 807 beetles and p = 143 distinct species. When 
using the statistic Dn^k, we obtain ni = 173.6 and 9i = 0.111. When using the statistic ij^n.k, we obtain 
n2 = 173.6 and 02 = 0.108. 
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[Tab. 8 about here.] 

For all the three data sets, the two values of n are identical. Moreover the two values of 9 are close, which is not 
always the case. It may be due to the fact that many species are abundant. 

4.3 Conclusion 

These two families of data sets give some illustration of the results obtained when applying the estimators devel- 
oped in this paper . In fact it also shows the computational limit of them. In particular one can observe that values 
of the two estimators ni and n2 of n (respectively based on Dn,k and ipn.k) may differ. This should arise especially 
when most of species are rare and when there were only few abundant species. However, over the fourteen sets of 
real data we used, this situation occurs four times. Estimations for the three data about tropical beetles seem to be 
exceptionally satisfactory. It may be due to the presence of many abundant species. 
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known 




unknown 










n 


n 


(ni, 6*1) based on Dn.k 


(n2, O2) based on ipn.k 


n = 100 


k = 


-- 66 


92.96 


90.98 


1.2E+09 


0.21 


9.8E+08 


2.7E-08 








(16.51) 


(15.80) 


(1.3E+09) 


(3.5E+08) 


(0.32) 


(l.lE-08) 




k = 


100 




VZ.ol 


2821.86 


0.21 


3322.34 


0.01 








(13.29) 


(13.05) 


(1884.31) 


(12632.13) 


(0.33) 


(0.01) 




7 

k = 


150 




S^1.4o 


91.83 


1.13 


130.06 


0.61 








(9.62) 


(9.55) 


(9.63) 


(268.89) 


(0.23) 


(0.12) 


n = 200 


k = 


133 


206.76 


204.33 


223.08 


1.09 


4731.03 


0.01 








(31.23) 


(30.46) 


(153.28) 


(0.23) 


(767.15) 


(0.001) 




k = 


zOO 


'Mil 




201.22 


1.11 


201.22 


0.46 








(lo.U/) 


(17.89) 


(18.07) 


(0.18) 


(18.07) 


(0.04) 




k = 


oOO 






200.30 


1.09 


200.30 


0.58 








(13.28) 


(13.22) 


(13.28) 


(0.15) 


(13.28) 


(0.05) 


n = 500 


k = 


333 


513.34 


510.96 


513.34 


0.99 


513.34 


0.32 








(41.89) 


(41.49) 


(41.89) 


(0.13) 


(41.89) 


(0.02) 




k = 


250 


515.20 


514.14 


515.20 


0.99 


515.20 


0.43 








(27.79) 


(27.68) 


(27.79) 


(0.13) 


(27.79) 


(0.03) 




k = 


750 


515.15 


514.68 


515.15 


0.98 


515.15 


0.54 








(21.61) 


(21.57) 


(21.61) 


(0.10) 


(21.61) 


(0.03) 



Tab. 1: Estimation over simulated data witii 9 = 
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6 unknown 










n 


{ni, 6i) based on Dn,k 


(n2,02) based ont{)n,k 


n = 100 


k = 


66 


108.21 


8.2E+08 


0.16 


7.5E+08 


3.2E-08 








(19.39) 


(8.5E+08) 


(0.30) 


(2.7E+08) 


(1.3E-08) 




K = 


iUU 




1418.81 


0.29 


1708.36 


0.02 










(1547.26) 


(0.32) 


(968.16) 


(0.02) 




7 

k = 


150 


162.34 


156.51 


0.91 


2462.54 


0.38 










(614.64) 


(0.23) 


(16084.16) 


(0.29) 


n = 200 


k = 


133 


136.75 


4.2E+08 


0.33 


5.9E+08 


0.002 








/O /I o o\ 

(84.83) 


(6.9E+08) 


(0.41) 


(3.8E+08) 


(0.005) 




k = 


zUU 


1/11 A/; 


836.19 


0.42 


1222.67 


0.11 










(1328.58) 


(0.38) 


(1371.98) 


(0.19) 




k = 


oUU 




473.93 


0.69 


2828.47 


0.19 










(3672.28) 


(0.310) 


(30533.87) 


(0.26) 


n = 500 


k = 


333 


232.11 


4.2E+08 


0.35 


5.9E+08 


0.07 








(270.59) 


(6.9E+08) 


(0.39) 


(3.8E+08) 


(0.15) 




k = 


250 


234.31 


882.81 


0.41 


1269.30 


0.11 








(272.37) 


(1308.03) 


(0.37) 


(1338.81) 


(0.18) 




k = 


750 


277.95 


521.22 


0.68 


2875.75 


0.19 








(256.56) 


(3669.47) 


(0.30) 


(30529.92) 


(0.25) 



Tab. 2: Estimation over simulated data with 6 = 



Thierry Huillet and Christian Paroissin 









(7 MlUWll 




6 unknown 










n 


(ni, ui) based 


on Dn,k 


[712 J ^2) based 


on tpn^k 


n= 100 


k = 


-- 66 


99.39 


7.801E+08 


0.18 


8.299E+08 


0.0001 








(19.49) 


(1.167E+09) 


(0.28) 


(3.365E+08) 


(0.003) 




k = 


100 


yy.3/ 


1740.15 


U.ZD 


io /4. /j 


A A 1 

0.01 










(,/// j.y 1 J 


(U.3U) 


(1331.18) 


(0.01) 




k = 


150 


155.01 


465.64 


U.74 


ZoZU.l / 


0.0001 








/I Q 

(iO.ZJ } 




(0.37) 


(31)534.00) 


(0.25) 


n = 200 


k = 


133 


151.57 


4.195E+08 


0.40 


5.922E+08 


0.002 








(110.93) 


(6.926E+08) 


(0.47) 


(3.830E+08) 


(0.005) 




k = 


200 


151.08 


853.30 


0.47 


1239.78 


0.09 








(107.15) 


(1320.33) 


(0.45) 


(1359.18) 


(0.15) 




k = 


300 


189.47 


488.62 


0.75 


2843.15 


0.19 








(81.03) 


(3671.15) 


(0.38) 


(30532.61) 


(0.25) 


n = 500 


k = 


333 


254.38 


4.2E+08 


0.37 


5.9E+08 


0.05 








(314.97) 


(6.9E+08) 


(0.41) 


(3.8E+08) 


(0.11) 




k = 


250 


253.11 


921.32 


0.44 


1307.81 


0.09 








(309.82) 


(1295.86) 


(0.40) 


(1315.75) 


(0.15) 




k = 


750 


291.79 


556.83 


0.71 


2911.36 


0.18 








(284.10) 


(3668.96) 


(0.33) 


(30527.13) 


(0.24) 



Tab. 3: Estimation over simulated data with S = | 



Estimating the number of species 



21 








1 


2 


3 


4 


5 


6 




156 


63 


29 


8 


4 


1 


1 



Tab. 4: Madison data 



Thierry Huillet and Christian Paroissin 








1 


2 


3 


4 


5 


6 




157 


60 


20 


5 


2 


2 


1 



Tab. 5: Hamilton data 



Estimating the number of species 



q 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


11 


12 


14 




70 


17 


4 


5 


5 


5 


5 


3 


1 


2 


3 


2 


2 


q 


17 


29 


20 


21 


24 


26 


40 


57 


60 


64 


71 


77 






1 


2 


3 


1 


1 


1 


1 


2 


1 


1 


1 


1 





Tab. 6: Osa secondary/day/dry/1967 data 



24 



Thierry Huillet and Christian Paroissin 



q 


1 


2 


3 


4 


5 


7 


8 


9 


10 


11 


12 




61 


24 


13 


12 


5 


6 


5 


2 


4 


2 


3 


q 


13 


15 


17 


18 


19 


26 


30 


33 


40 


44 


62 


Mq) 


1 


1 


1 


2 


2 


1 


1 


1 


1 


1 


2 



Tab. 7: Osa secondary/nightydry/1967 data 



Estimating the number of species 



q 


1 


2 


3 


4 


5 


6 


7 


9 


10 


11 


12 


13 




85 


12 


10 


4 


6 


3 


5 


1 


2 


1 


1 


1 


q 


15 


18 


20 


24 


25 


28 


29 


30 


79 


106 


112 




A{q) 


1 


2 


1 


1 


1 


1 


1 


1 


1 


1 


1 





Tab. 8: Osa secondary/day/dry/ 1968 



